The dataset I’ll be using in this project is titled “Most Streamed Spotify Songs 2023”, uploaded to Kaggle by Nidula Elgiriyewithana. It was collected using the Spotify API and includes data on many different aspects of the songs, such as release date, track name, artist name, streams, and bpm. In this project, I’ll analyze some of the most common features present in popular songs on Spotify.
songs <- read.csv("spotify-2023.csv")
head(songs)
Before we begin analyzing the data, we’ll start by reordering the songs in order from most streams to least streams, and give them a corresponding key for how popular they are overall. We also notice at least one entry with no data for streams, so we’ll remove any entries without a numerical entry for streams. There also seems to be some unknown characters that could cause errors, so we’ll clean any special or unknown characters from the artist and track names. To help maintain readability, we’ll replace them with a question mark.
Finally, I have filtered for songs released after 2010. This is largely due to the fact that most of the songs in the dataset released before 2010 are holiday songs or “classic” songs that remain popular over time, such as Riptide by Vance Joy. These songs are generally singles, and filtering them out will allow for much more readable graphs and allow us to explore the popularity of more recent songs.
#arrange songs by number of streams and add ranking column
rankedSongs <- songs %>% mutate(streams=as.numeric(streams)) %>% arrange(desc(streams)) %>% mutate(ranking=row_number()) %>% relocate(25, .before = 1)
#filter entries with N/A streams
rankedSongs <- rankedSongs %>% filter(!is.na(streams), released_year>=2010)
#rename column for easier access
rankedSongs <- rankedSongs %>% rename("artist_name"="artist.s._name")
#clean special characters from artist names and track names
rankedSongs <- rankedSongs %>% mutate(artist_name=str_replace_all(artist_name,"[^[:alnum:]], ", "?")) %>% mutate(artist_name=gsub("�","?",artist_name))
rankedSongs <- rankedSongs %>% mutate(track_name=str_replace_all(track_name,"[^[:alnum:]], ", "?")) %>% mutate(track_name=gsub("�","?",track_name))
rankedSongs
With the data now in a more usable format, we’ll begin by exploring the most popular release dates. I’ll count the number of songs released in each month/year combination from 2010 to 2023, then graph them in a heat map to see what the most common release month was.
#Create new dataframe, count number of occurrences for each month/year combination
release_dates <- data.frame()
release_dates <- release_dates %>% add_column(year=NA, month=NA, num=NA)
for (i in 2010:2023) {
for (k in 1:12) {
release_dates
release_dates <- release_dates %>% add_row(
year=i,
month=k,
num=length(which(rankedSongs$released_year==i & rankedSongs$released_month == k)))
}
}
#add text for graph tooltips
release_dates <- release_dates %>%
mutate(text = paste0("x: ", month, "\n", "y: ", year, "\n", "Value: ", num))
#graph heatmap
date_heatmap <- release_dates %>% ggplot(aes(x=month,y=year,fill=num, text=text)) +
geom_tile() +
scale_fill_viridis_c() +
labs(x="Month", y="Year", fill = "# of Songs")
#create interactive graph
ggplotly(date_heatmap,tooltip="text") %>%
layout(title=list(text=paste0("Most Common Song Release Dates",'<br>','<sup>',"By Month and Year (2010-2023)", '</sup>')), margin = list(t = 60))
Most of the songs were released in the past few years, which makes sense. However, there are still a significant amount of songs in these rankings released between 2010 and 2020 that are still popular. We note that there is no data for songs released after July of 2023, which is due to when the data was collected. We do notice some strong peaks of songs released in May, October, and December of 2022, with May 2022 having the most songs at 75.
Looking at the dataset, we can quickly see that some artists appear numerous times, even in the top few rankings. To determine how common this is, we’ll create a barplot to visualize the artists of each song. Since we only want to see the artists who appear in the list the most in this graph, only artists who appear in the dataset three or more times will be included.
#count occurrences of each artist
artist_occurrences <- rankedSongs %>% group_by(artist_name) %>% summarise(num=n()) %>% arrange(desc(num))
artist_occurrences <- artist_occurrences %>% filter(num>=3) %>% mutate(id=row_number()) %>% relocate(3, .before = 1)
artist_occurrences
Even after filtering the data to artists with more than three songs, most artists fall within the 3 to 9 range, however a few artists have significantly more. We’ll graph the number of songs for the top ten artists to get a visual representation of these significant artists.
#Prepare graph information
graph_data <- artist_occurrences %>% filter(row_number(id) <= 10)
graph_data$discrete <- cut(graph_data$num, breaks = 10)
#Create bargraph of artists
artist_graph <- graph_data %>% ggplot(aes(y=reorder(artist_name,+num), x=num, fill=discrete)) +
geom_bar(stat="identity") +
scale_fill_viridis_d(guide="none", direction = 1) +
geom_text(aes(label=num), hjust=-0.2) +
labs(title="Top Ten Artists On Spotify 2023", subtitle="By Number of Popular Songs", x="Number Of Hit Songs",y="Artist")
artist_graph
The most notable takeaway from this data is that Taylor Swift has by far the largest number of hits, with a total of 34 songs. Most artists appear only once or twice in the data, and even BTS, who appears as number 10 in number of streams, only has 8 hit songs. This leads to the question - what is it about Taylor Swift’s songs that make them so popular?
To follow this line of questioning, I filtered two new datasets, one with only Taylor Swift’s hit songs, and the other excluding her songs.
hit_songs_tswift <- rankedSongs %>% filter(artist_name == "Taylor Swift")
hit_songs_no_tswift <- rankedSongs %>% filter(artist_name != "Taylor Swift")
head(hit_songs_tswift)
head(hit_songs_no_tswift)
I’ll then use these new datasets to graph some comparisons between the two.
taylor_dance_graph <- ggplot() +
# Non-Taylor data
geom_point(data=hit_songs_no_tswift, aes(x=ranking, y=(danceability_.), color='All Artists')) +
geom_hline(data=hit_songs_no_tswift, aes(yintercept=mean(danceability_.)), color="blue") +
annotate("text", x=32,y=70,label="Overall Avg.", color="blue") +
# Taylor Swift data
geom_point(data=hit_songs_tswift, aes(x=ranking, y=(danceability_.),color='Taylor Swift')) +
geom_hline(data=hit_songs_tswift, aes(yintercept=mean(danceability_.)), color="red") +
annotate("text", x=32,y=57,label="T. Swift Avg.", color="red") +
scale_x_reverse()+
labs(title="Danceability by Spotify Streams", subtitle="Taylor Swift vs. Overall", x="Ranking (by # of Streams)", y="Danceability %") +
scale_color_manual(name="Artist",
breaks=c('Taylor Swift', 'All Artists'),
values=c('Taylor Swift'='red', 'All Artists'='lightblue'))
taylor_dance_graph
taylor_energy_graph <- ggplot() +
# Non-Taylor data
geom_point(data=hit_songs_no_tswift, aes(x=ranking, y=(energy_.), color='All Artists')) +
geom_hline(data=hit_songs_no_tswift, aes(yintercept=mean(energy_.)), color="blue") +
annotate("text", x=32,y=68,label="Overall Avg.", color="blue") +
# Taylor Swift data
geom_point(data=hit_songs_tswift, aes(x=ranking, y=(energy_.), color='Taylor Swift')) +
geom_hline(data=hit_songs_tswift, aes(yintercept=mean(energy_.)), color="red") +
annotate("text", x=32,y=54,label="T. Swift Avg.", color="red") +
scale_x_reverse() +
labs(title="Song Energy by Spotify Streams", subtitle="Taylor Swift vs. Overall", x="Ranking (by # of Streams)", y="Energy %") +
scale_color_manual(name="Artist",
breaks=c('Taylor Swift', 'All Artists'),
values=c('Taylor Swift'='red', 'All Artists'='lightblue'))
taylor_energy_graph
taylor_speech_graph <- ggplot() +
# Non-Taylor data
geom_point(data=hit_songs_no_tswift, aes(x=ranking, y=speechiness_., color='All Artists')) +
geom_hline(data=hit_songs_no_tswift, aes(yintercept=mean(speechiness_.)), color="blue") +
annotate("text", x=32,y=13,label="Overall Avg.", color="blue") +
# Taylor Swift data
geom_point(data=hit_songs_tswift, aes(x=ranking, y=speechiness_., color='Taylor Swift')) +
geom_hline(data=hit_songs_tswift, aes(yintercept=mean(speechiness_.)), color="red") +
annotate("text", x=32,y=5,label="T. Swift Avg.", color="red") +
scale_x_reverse() +
labs(title="Speechiness by Spotify Streams", subtitle="Taylor Swift vs. Overall", x="Ranking (by # of Streams)", y="Speechiness %") +
scale_color_manual(name="Artist",
breaks=c('Taylor Swift', 'All Artists'),
values=c('Taylor Swift'='red', 'All Artists'='lightblue'))
taylor_speech_graph
taylor_bpm_graph <- ggplot() +
# Non-Taylor data
geom_point(data=hit_songs_no_tswift, aes(x=ranking, y=bpm, color='All Artists')) +
geom_hline(data=hit_songs_no_tswift, aes(yintercept=mean(bpm)), color="blue") +
annotate("text", x=32,y=118,label="Overall Avg.", color="blue") +
# Taylor Swift data
geom_point(data=hit_songs_tswift, aes(x=ranking, y=bpm, color='Taylor Swift')) +
geom_hline(data=hit_songs_tswift, aes(yintercept=mean(bpm)), color="red") +
annotate("text", x=32,y=131,label="T. Swift Avg.", color="red") +
scale_x_reverse() +
labs(title="BPM by Spotify Streams", subtitle="Taylor Swift vs. Overall", x="Ranking (by # of Streams)", y="BPM") +
scale_color_manual(name="Artist",
breaks=c('Taylor Swift', 'All Artists'),
values=c('Taylor Swift'='red', 'All Artists'='lightblue'))
taylor_bpm_graph
There don’t seem to be any particularly strong correlations between danceability, energy, speechiness or bpm and ranking, and Swift’s songs seem to be pretty evenly distributed throughout all four. It seems her higher ranking songs have a slight tendency to have more danceability, and a higher energy level. We can also note one outlier in the speechiness category - while most of her songs fall below the 20% mark, there is one song at 39%. This song is titled “Vigilante Shit”, which also has a lower BPM than average, at only 80 BPM.
With a quick Google search, we discover that Spotify generally pays $0.003 to $0.005 per stream, so we can use this to calculate each artist’s overall profit (from only the hit songs in the dataset) by multiplying this value with their number of streams. Since we have a range of values, we’ll average it and use $0.004 per stream.
#create a new dataframe, calculate total streams and profit, and assign them new columns
artist_profit <- rankedSongs %>%
select(artist_name, streams) %>%
group_by(artist_name) %>%
summarise(totalStreams = sum(streams), spotify_profit = (round(totalStreams*0.004)))
#merge the profit dataframe with occurrence dataframe
artist_profit <- merge(artist_profit, artist_occurrences, by = "artist_name") %>% rename("num_songs"="num") %>% arrange(desc(num_songs))
artist_profit
#prepare tooltip text
artist_profit <- artist_profit %>% mutate(text = paste("Artist: ", artist_name, "\nSpotify Streams: ", totalStreams, "\nTotal Profit ($USD): ", spotify_profit, "\nNumber of Hit Songs: ", num_songs, sep=""))
#graph streams and profit, with size = number of songs
graph_profit <- artist_profit %>% ggplot(aes(x=num_songs, y=spotify_profit, size=totalStreams, color=artist_name, text=text)) +
geom_point(alpha=0.7) +
scale_size(range = c(0.5, 10), name="Number of Hit Songs") +
scale_color_viridis_d(guide="none") +
labs(x="Number of Songs", y="Total Profit") +
theme(legend.position='none') +
annotate("text", x=30.5,y=105,label="Size = Total Spotify Streams", size=3)
#make interactive graph
ggplotly(graph_profit, tooltip="text") %>%
layout(title=list(text=paste0("Number of Hit Songs on Spotify vs. Profit",'<br>','<sup>',"By Artist (2010-2023)", '</sup>')), margin = list(t = 60))
What we find from this data is that while there seems to be no distinct pattern to what type of songs are more popular on Spotify, there are indeed artists who appear far more frequently than others. This popularity doesn’t necessarily translate to more profit, however - even though Taylor Swift has an overwhelming 34 hit songs, she actually only makes second place for overall profit, with The Weekend pulling ahead by about $500,000 with only 22 hit songs. In fact, there are a few artists who only have a few hit songs, such as Ed Sheeran, who places in #3 for profit with only 9 songs, above Harry Styles and Bad bunny with 17 and 19 songs respectively. These results are likely to due huge popularity of one or two songs from each artist, which contribute a majority of the streams, but also drive traffic to their other songs as more people discover the new artist. Artists such as Taylor Swift, however, have a more established, dedicated fanbase that provides a more even spread of streams among all her songs.